Google platform

Google requires large computational resources in order to provide their services. This article describes the technological infrastructure behind Google's websites, as presented in the company's public announcements.

Contents

Hardware

Original hardware

The original hardware (circa 1998) that was used by Google when it was located at Stanford University included:[1]

Current hardware

Servers are commodity-class x86 PCs running customized versions of Linux. The goal is to purchase CPU generations that offer the best performance per dollar, not absolute performance, how this is measured is unclear but is likely to incorporate running costs of the entire server and CPU power consumption could be significant factor.[2] Servers as of 2009 consisted of a custom made open top server containing two processors (each with an unknown number of cores or interconnected processing units) a considerable amount of RAM spread over 8 DIMM slots housing double height DIMMS and two SATA hard drives connected through a standard ATX sized power supply. According to First April publication by CNET, Each server has a novel 12 volt battery to reduce costs and improve power efficiency [3]

Estimates of the power required for over 450,000 servers range upwards of 20 megawatts, which cost on the order of US$2 million per month in electricity charges. The combined processing power of these servers might reach from 20 to 100 petaflops.[4]

Specifications:

The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. A very old estimate (from 2000 while Google was in its infancy and had one product), Google's server farm consisted of 6,000 processors, 12,000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and one in Virginia.[9] Each site had an OC-48 (2488 Mbit/s) internet connection and an OC-12 (622 Mbit/s) connection to other Google sites. The connections are eventually routed down to 4 × 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two Ethernet switches.

Hardware details considered sensitive

In a 2008 book,[10] reporter Randall Stross wrote: "Google's executives have gone to extraordinary lengths to keep the company's hardware hidden from view. The facilities are not open to tours, not even to members of the press." He wrote this based on interviews with staff members and his experience of visiting the company.

Network topology

When a client computer attempts to connect to Google, several DNS servers resolve www.google.com into multiple IP addresses via Round Robin policy. Furthermore, this acts as the first level of load balancing and directs the client to different Google clusters. A Google cluster has thousands of servers and once the client has connected to the server additional load balancing is done to send the queries to the least loaded web server. This makes Google one of the largest and most complex content delivery networks.[11]

Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), while new servers are 2U Rackmount systems.[5] Each rack has a switch. Servers are connected via a 100 Mbit/s Ethernet link to the local switch. Switches are connected to core gigabit switch using one or two gigabit uplinks.

Data centers

Google has numerous data centers scattered around the world. At least 12 significant Google data center installations are located in the United States. The largest known centers are located in The Dalles, Oregon; Atlanta, Georgia; Reston, Virginia; Lenoir, North Carolina; and Goose Creek, South Carolina.[12] In Europe, the largest known centers are in Eemshaven and Groningen in the Netherlands and Mons, Belgium.[12] Google's Oceania Data Center is claimed to be located in Sydney, Australia. [13]

Project 02

One of the larger Google data centers is located in the town of The Dalles, Oregon, on the Columbia River, approximately 80 miles from Portland. Codenamed "Project 02", the $600 million[14] complex was built in 2006 and is approximately the size of two football fields, with cooling towers four stories high.[15] The site was chosen to take advantage of inexpensive hydroelectric power, and to tap into the region's large surplus of fiber optic cable, a remnant of the dot-com boom. A blueprint of the site has appeared in print.[16]

Summa papermill

In February 2009, Stora Enso announced that they had sold the Summa paper mill in Hamina, Finland to Google for 40 million Euros.[17][18] Google plans to invest 200 million euros on the site to build a data center.[19] For Google the reason to choose this location was the availability of renewable energy close by.[20]

Modular Container Data Centers

Since 2005,[21] Google has been moving to a containerized modular data center. Google filed a patent application for this technology in 2003.[22]

Software

Most of the software stack that Google uses on their servers was developed in-house.[23] It is believed that C++, Java, and Python are favored over other programming languages.[24] For example, the back-end of Gmail is written in Java and the back-end of Google Search is written in C++.[25] Google has acknowledged that Python has played an important role from the beginning, and that it continues to do so as the system grows and evolves.[26]

The software that runs the Google infrastructure includes:[27]

Google has developed several abstractions which it uses for storing most of its data:[32]

Software development practices

Most operations are read-only. When an update is required, queries are redirected to other servers, so as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.[5]

To lessen the effects of unavoidable hardware failure, software is designed to be fault tolerant. Thus, when a system goes down, data is still available on other servers, which increases reliability.

Search infrastructure

Index

Like most search engines, Google indexes documents by building a data structure known as inverted index. Such an index allows obtaining a list of documents by a query word. The index is very large due to the number of documents stored in the servers.[11]

The index is partitioned by document IDs into many pieces called shards. Each shard is replicated onto multiple servers. Initially, the index was being served from hard disk drives, like it's done in traditional information retrieval (IR) systems. Google dealt with increasing volume of queries by increasing number of replicas of each shard and thus increasing number of servers. Soon they had found that they had enough servers to keep a copy of the whole index in main memory (although with low replication or no replication at all), and in early 2001 Google switched to an in-memory index system. This switch had "radically changed many design parameters" of their search system, and allowed them to enjoy a big increase in throughput and a big decrease in latency of queries.[37]

In June 2010 Google rolled out a next-generation indexing and serving system called "Caffeine" which can continuously crawl and update search index. Previously, Google updated its search index in batches using a series of MapReduce jobs. The index was separated into several layers, some of which were updated faster than the others, and the main layer wouldn't be updated for as long as two weeks. With Caffeine the entire index is updated incrementally on a continuous basis. Later Google revealed a distributed data processing system called "Percolator"[38] which is said to be the basis of Caffeine indexing system.[31][39]

Some details about Google's inverted index compression schemes have been made public.[37][40]

Server types

Google's server infrastructure is divided in several types, each assigned to a different purpose:[11][5][41][42][43]

References

  1. ^ "Google Stanford Hardware." Stanford University (provided by Internet Archive). Retrieved on July 10, 2006.
  2. ^ Tawfik Jelassi and Albrecht Enders (2004). "Case study 16 — Google". Strategies for E-business. Pearson Education. p. 424. ISBN 0273688405. 
  3. ^ [1], 1 april 2009.
  4. ^ Google Surpasses Supercomputer Community, Unnoticed?, May 20, 2008.
  5. ^ a b c d Web Search for a Planet: The Google Cluster Architecture (Luiz André Barroso, Jeffrey Dean, Urs Hölzle)
  6. ^ Strassmann, Paul A. "A Model for the Systems Architecture of the Future." December 5, 2005. Retrieved on March 18, 2008.
  7. ^ Carr, David F. "How Google Works." Baseline Magazine. July 6, 2006. Retrieved on July 10, 2006.
  8. ^ a b Jeff Dean. (2009). Design, Lessons and Advice from Building Large Distributed Systems.
  9. ^ Hennessy, John; Patterson, David (2002). Computer Architecture: A Quantitative Approach (Third ed.). Morgan Kaufmann. ISBN 1558605967. .
  10. ^ Randall Stross (2008). Planet Google. New York: Free Press. p. 61. ISBN 1-4165-4691-X. 
  11. ^ a b c Fiach Reid (2004). "Case Study: The Google search engine". Network Programming in .NET. Digital Press. pp. 251–253. ISBN 1555583156. 
  12. ^ a b Rich Miller (March 27, 2008). "Google Data Center FAQ". Data Center Knowledge. http://www.datacenterknowledge.com/archives/2008/03/27/google-data-center-faq/. Retrieved 2009-03-15. 
  13. ^ Brett Winterford (March 5, 2010). "Found: Google Australia's secret data network". ITNews. http://www.itnews.com.au/News/168772,found-google-australias-secret-data-network.aspx. Retrieved 2010-03-20. 
  14. ^ Google "The Dalles, Oregon Data Center" Retrieved on January 3, 2011.
  15. ^ Markoff, John; Hansell, Saul. "Hiding in Plain Sight, Google Seeks More Power." New York Times. June 14, 2006. Retrieved on October 15, 2008.
  16. ^ Strand, Ginger. "Google Data Center" Harper's Magazine. March 2008. Retrieved on October 15, 2008.
  17. ^ "Stora Enso divests Summa Mill premises in Finland for EUR 40 million". Stora Enso. 2009-02-12. http://www.storaenso.com/media-centre/press-releases/2009/02/Pages/stora-enso-divests-summa-mill.aspx. Retrieved 12.02.2009. 
  18. ^ "Stooora yllätys: Google ostaa Summan tehtaan" (in (Finnish)). Kauppalehti (Helsinki). 2009-02-12. http://www.kauppalehti.fi/5/i/talous/uutiset/etusivu/uutinen.jsp?oid=2009/02/18987. Retrieved 2009-02-12. 
  19. ^ "Google investoi 200 miljoonaa euroa Haminaan" (in (Finnish)). Taloussanomat (Helsinki). 2009-02-04. http://www.taloussanomat.fi/talous/2009/03/04/google-investoi-200-miljoonaa-euroa-haminaan/20095951/133. Retrieved 2009-03-15. 
  20. ^ Finland - First Choice for Siting Your Cloud Computing Data Center. Accessed 4 August 2010.
  21. ^ http://www.theregister.co.uk/2009/04/10/google_data_center_video
  22. ^ http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/7278273
  23. ^ Mark Levene (2005). An Introduction to Search Engines and Web Navigation. Pearson Education. p. 73. ISBN 0321306775. 
  24. ^ http://www.artima.com/weblogs/viewpost.jsp?thread=143947
  25. ^ http://panela.blog-city.com/python_at_google_greg_stein__sdforum.htm
  26. ^ http://python.org/about/quotes/
  27. ^ http://highscalability.com/google-architecture
  28. ^ a b c Andrew Fikes. Storage Architecture and Challenges. Google TechTalk. July 29, 2010.
  29. ^ Intel Corp. Seizing the Open Source Cloud Stack Opportunity. See slide "Proprietary Cloud Computing Stacks".
  30. ^ Anna Patterson - CrunchBase Profile
  31. ^ a b The Register. Google Caffeine jolts worldwide search machine
  32. ^ a b http://www.eweekeurope.co.uk/news/news-it-infrastructure/google-developing-caffeine-storage-system-1620
  33. ^ http://code.google.com/apis/protocolbuffers/docs/overview.html
  34. ^ http://labs.google.com/papers/bigtable-osdi06.pdf
  35. ^ http://www.windley.com/archives/2008/06/velocity_08_storage_at_scale.shtml
  36. ^ http://groups.google.com/group/protobuf/browse_thread/thread/ee27572aef9da70a
  37. ^ a b Jeff Dean's keynote at WSDM 2009
  38. ^ Daniel Peng, Frank Dabek. (2010). Large-scale Incremental Processing Using Distributed Transactions and Notifications. Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation.
  39. ^ The Register. Google Percolator – global search jolt sans MapReduce comedown
  40. ^ GroupVarInt encoding from Jeff's talk is also described in the following sources:
    1) U.S. patent 7068192, filed in 2004, issued in 2006;
    2) Boulos Harb, Ciprian Chelba, Jeffrey Dean, Sanjay Ghemawat. (2009). Back-Off Language Model Compression. Proceedings of Interspeech 2009, pp. 325-355.
  41. ^ Chandler Evans (2008). "Google Platform". Future of Google Earth. Madison Publishing Company. p. 299. ISBN 1419689037. 
  42. ^ Chris Sherman (2005). "How Google Works". Google Power. McGraw-Hill Professional. pp. 10–11. ISBN 0072257873. 
  43. ^ Michael Miller (2007). "How Google Works". Googlepedia. Pearson Technology Group. pp. 17–18. ISBN 078973639X. 

Further reading

External links